{
 "metadata": {
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.9-final"
  },
  "orig_nbformat": 2,
  "kernelspec": {
   "name": "python3",
   "display_name": "DataAnalysis"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2,
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "import nltk\n",
    "from tools import show_subtitle"
   ]
  },
  {
   "source": [
    "# Ch8 分析句子结构\n",
    "\n",
    "前面的章节重点关注词的处理：识别、分析结构、分配词汇类别、获取词汇的含义、识别词序列或者n-grams的模式\n",
    "\n",
    "本章的学习目标：\n",
    "\n",
    "1.  使用形式化语法来描述无限的句子集合的结构\n",
    "2.  使用句法树来表示句子结构\n",
    "3.  使用解析器来分析句子，并且自动构建语法树\n"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "source": [
    "## 8.1 一些语法困境"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "source": [
    "### 8.1.1 语言数据和无限的可能性\n",
    "\n",
    "本章中将采用“生成文法”的形式化框架，其中一种“语言”被认为仅仅是所有合乎文法的句子的大集合，而文法只是一个形式化符号，可以用于“生成”这个集合的成员。\n",
    "\n",
    "文法的目的是给出一个明确的语言描述，使用 `S → S and S` 形式的递归产生式"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "source": [
    "### 8.1.2 普遍存在的歧义"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "groucho_grammar = nltk.CFG.fromstring('''\n",
    "S -> NP VP\n",
    "PP -> P NP\n",
    "NP -> Det N | Det N PP | \"I\"\n",
    "VP -> V NP | VP PP\n",
    "Det -> \"an\" | \"my\"\n",
    "N -> \"elephant\" | \"pajamas\"\n",
    "V -> \"shot\"\n",
    "P -> \"in\"\n",
    "''')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "--------------- >第 1 个结构< ---------------\n(S\n  (NP I)\n  (VP\n    (VP (V shot) (NP (Det an) (N elephant)))\n    (PP (P in) (NP (Det my) (N pajamas)))))\n--------------- >第 2 个结构< ---------------\n(S\n  (NP I)\n  (VP\n    (V shot)\n    (NP (Det an) (N elephant) (PP (P in) (NP (Det my) (N pajamas))))))\n"
     ]
    }
   ],
   "source": [
    "# 基于一种文法解析句子，可能会解析出两种结构\n",
    "sent = ['I', 'shot', 'an', 'elephant', 'in', 'my', 'pajamas']\n",
    "parser = nltk.ChartParser(groucho_grammar)  # 图解析\n",
    "for i, tree in enumerate(parser.parse(sent)):\n",
    "    show_subtitle(f\"第 {i+1} 个结构\")\n",
    "    print(tree)"
   ]
  },
  {
   "source": [
    "## 8.2 文法的用途"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "source": [
    "### 8.2.1 超越 n-grams\n",
    "\n",
    "-   成分结构：是词与词结合在一起组成的单元。\n",
    "-   词汇的可替代性：在符合语法规则的句子中，长的词序列可以被一个更短的词序列替代，并且替代后不会导致句子不合语法规则\n",
    "    -   形成单元的每个序列都可以被单独的词替换。\n",
    "    -   短语的方法类别标签\n",
    "        -   名词短语（NP）\n",
    "        -   动词短语（VP）\n",
    "        -   介词短语（PP）\n",
    "\n",
    "句子长度是任意的，因此短语结构树的深度也是任意的。因为Sec7.4只能产生有限深度的结构，所以分块方法并不适合用于句法分析。\n"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "source": [
    "## 8.3 上下文无关文法（context-free grammars，CFG）"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "source": [
    "### 8.3.1 一种简单的文法"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Ex8-1 一个简单的上下文无关文法的例子\n",
    "grammar1 = nltk.CFG.fromstring(\"\"\"\n",
    "S -> NP VP\n",
    "VP -> V NP | V NP PP\n",
    "PP -> P NP\n",
    "V -> \"saw\" | \"ate\" | \"walked\"\n",
    "NP -> \"John\" | \"Mary\" | \"Bob\" | Det N | Det N PP\n",
    "Det -> \"a\" | \"an\" | \"the\" | \"my\" | \"The\"\n",
    "N -> \"man\" | \"dog\" | \"cat\" | \"telescope\" | \"park\"\n",
    "P -> \"in\" | \"on\" | \"by\" | \"with\"\n",
    "\"\"\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "--------------- >第 1 个结构< ---------------\n(S\n  (NP (Det The) (N dog))\n  (VP\n    (V saw)\n    (NP (Det a) (N man) (PP (P in) (NP (Det the) (N park))))))\n--------------- >第 2 个结构< ---------------\n(S\n  (NP (Det The) (N dog))\n  (VP\n    (V saw)\n    (NP (Det a) (N man))\n    (PP (P in) (NP (Det the) (N park)))))\n"
     ]
    }
   ],
   "source": [
    "# 句子剖析会出现两个符合文法规则的结果，称为结构上有歧义。这个歧义称为介词短语附着歧义。\n",
    "sent = 'The dog saw a man in the park'.split()\n",
    "rd_parser = nltk.RecursiveDescentParser(grammar1)  # 递归下降解析器 RecursiveDescentParser()\n",
    "for i, tree in enumerate(rd_parser.parse(sent)):\n",
    "    show_subtitle(f\"第 {i + 1} 个结构\")\n",
    "    print(tree)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "nltk.app.rdparser()  # 通过这个演示可以辅助理解从顶向下的回溯策略的句法剖析过程"
   ]
  },
  {
   "source": [
    "### 8.3.2 编写自己的文法"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "--------------- >第 1 个结构< ---------------\n(S (NP Mary) (VP (V saw) (NP Bob)))\n"
     ]
    }
   ],
   "source": [
    "# 在文本文件创建和编辑语法会更加文法，然后可以利用函数加载到NLTK中进行解析\n",
    "grammar1 = nltk.data.load('mygrammar.cfg')\n",
    "sent = \"Mary saw Bob\".split()\n",
    "rd_parser = nltk.RecursiveDescentParser(grammar1)  # trace = 2 不知道有何作用\n",
    "for i, tree in enumerate(rd_parser.parse(sent)):\n",
    "    show_subtitle(f\"第 {i + 1} 个结构\")\n",
    "    print(tree)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "S -> NP VP\nVP -> V NP\nVP -> V NP PP\nPP -> P NP\nV -> 'saw'\nV -> 'ate'\nV -> 'walked'\nNP -> 'John'\nNP -> 'Mary'\nNP -> 'Bob'\nNP -> Det N\nNP -> Det N PP\nDet -> 'a'\nDet -> 'an'\nDet -> 'the'\nDet -> 'my'\nDet -> 'The'\nN -> 'man'\nN -> 'dog'\nN -> 'cat'\nN -> 'telescope'\nN -> 'park'\nP -> 'in'\nP -> 'on'\nP -> 'by'\nP -> 'with'\n"
     ]
    }
   ],
   "source": [
    "# 输出文法文件中的内容\n",
    "for p in grammar1.productions():\n",
    "    print(p)"
   ]
  },
  {
   "source": [
    "### 8.3.3 句法结构中的递归\n",
    "\n",
    "RecursiveDescentParser()无法处理形如X→XY的左递归产生式"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Ex-2：递归的上下文无关文法\n",
    "grammar2 = nltk.CFG.fromstring(\"\"\"\n",
    "S  -> NP VP\n",
    "NP -> Det Nom | PropN\n",
    "Nom -> Adj Nom | N\n",
    "VP -> V Adj | V NP | V S | V NP PP\n",
    "PP -> P NP\n",
    "PropN -> 'Buster' | 'Chatterer' | 'Joe'\n",
    "Det -> 'the' | 'a'\n",
    "N -> 'bear' | 'squirrel' | 'tree' | 'fish' | 'log'\n",
    "Adj  -> 'angry' | 'frightened' |  'little' | 'tall'\n",
    "V ->  'chased'  | 'saw' | 'said' | 'thought' | 'was' | 'put'\n",
    "P -> 'on'\n",
    "\"\"\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "--------------- >第 1 个结构< ---------------\n(S\n  (NP (Det the) (Nom (Adj angry) (Nom (N bear))))\n  (VP\n    (V chased)\n    (NP\n      (Det the)\n      (Nom (Adj frightened) (Nom (Adj little) (Nom (N squirrel)))))))\n"
     ]
    }
   ],
   "source": [
    "sent = 'the angry bear chased the frightened little squirrel'.split()\n",
    "# RecursiveDescentParser()无法处理形如X→XY的左递归产生式\n",
    "rd_parser = nltk.RecursiveDescentParser(grammar2)\n",
    "for i, tree in enumerate(rd_parser.parse(sent)):\n",
    "    show_subtitle(f\"第 {i + 1} 个结构\")\n",
    "    print(tree)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "--------------- >第 1 个结构< ---------------\n(S\n  (NP (PropN Chatterer))\n  (VP\n    (V said)\n    (S\n      (NP (PropN Buster))\n      (VP\n        (V thought)\n        (S (NP (Det the) (Nom (N tree))) (VP (V was) (Adj tall)))))))\n"
     ]
    }
   ],
   "source": [
    "sent = 'Chatterer said Buster thought the tree was tall'.split()\n",
    "rd_parser = nltk.RecursiveDescentParser(grammar2)\n",
    "for i, tree in enumerate(rd_parser.parse(sent)):\n",
    "    show_subtitle(f\"第 {i + 1} 个结构\")\n",
    "    print(tree)"
   ]
  },
  {
   "source": [
    "## 8.4 上下文无关文法分析\n",
    "\n",
    "解析器根据文法产生式处理输入的句子，并且建立一个或者多个符合文法的组成结构\n",
    "\n",
    "文法是一个格式良好的声明规则——实际上只是一个字符串，而不是程序。\n",
    "\n",
    "解析器是文法的解释程序，用于搜索所有的符合文法的树的空间，并且找出一棵与句子匹配的语法树\n",
    "\n",
    "两种分析算法：\n",
    "\n",
    "1.  自顶向下的递归下降分析，主要缺点：\n",
    "    -   左递归产生式，如：NP→NP PP，会进入死循环\n",
    "    -   解析器在处理不符合输入句子的词和结构时会浪费许多时间\n",
    "    -   回溯过程中可能会丢弃分析过的成分，需要再次重建\n",
    "2.  自底向上的移进归约分析，只建立与输入中的词对应的结构，对于每个子结构只建立一次\n",
    "    -   反复将下一个输入词捡到堆栈，叫做移位操作\n",
    "    -   替换前n项为一项的操作，叫做归约操作\n",
    "\n",
    "「移进——归约」解析器 ShiftReduceParser()"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "grammar1 = nltk.CFG.fromstring(\"\"\"\n",
    "S -> NP VP\n",
    "VP -> V NP | V NP PP\n",
    "PP -> P NP\n",
    "V -> \"saw\" | \"ate\" | \"walked\"\n",
    "NP -> \"John\" | \"Mary\" | \"Bob\" | Det N | Det N PP\n",
    "Det -> \"a\" | \"an\" | \"the\" | \"my\" | \"The\"\n",
    "N -> \"man\" | \"dog\" | \"cat\" | \"telescope\" | \"park\"\n",
    "P -> \"in\" | \"on\" | \"by\" | \"with\"\n",
    "\"\"\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "--------------- >第 1 个结构< ---------------\n(S (NP Mary) (VP (V saw) (NP (Det a) (N dog))))\n"
     ]
    }
   ],
   "source": [
    "sent = 'Mary saw a dog'.split()\n",
    "sr_parser = nltk.ShiftReduceParser(grammar1)\n",
    "for i, tree in enumerate(sr_parser.parse(sent)):\n",
    "    show_subtitle(f\"第 {i + 1} 个结构\")\n",
    "    print(tree)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "Parsing 'Mary saw a dog'\n    [ * Mary saw a dog]\n  S [ 'Mary' * saw a dog]\n  R [ NP * saw a dog]\n  S [ NP 'saw' * a dog]\n  R [ NP V * a dog]\n  S [ NP V 'a' * dog]\n  R [ NP V Det * dog]\n  S [ NP V Det 'dog' * ]\n  R [ NP V Det N * ]\n  R [ NP V NP * ]\n  R [ NP VP * ]\n  R [ S * ]\n--------------- >第 1 个结构< ---------------\n(S (NP Mary) (VP (V saw) (NP (Det a) (N dog))))\n"
     ]
    }
   ],
   "source": [
    "# 跟踪解析的过程\n",
    "sent = 'Mary saw a dog'.split()\n",
    "sr_parser = nltk.ShiftReduceParser(grammar1,trace = 2)\n",
    "for i, tree in enumerate(sr_parser.parse(sent)):\n",
    "    show_subtitle(f\"第 {i + 1} 个结构\")\n",
    "    print(tree)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "Warning: VP -> V NP PP will never be used\n"
     ]
    }
   ],
   "source": [
    "nltk.app.srparser()  # 通过这个演示可以辅助理解自底向上的称进归约的句法剖析过程"
   ]
  },
  {
   "source": [
    "### 8.4.3 左角落解析器\n",
    "\n",
    "是 自顶向下 和 自底向上 方法的混合体，是一个带有自底向上过滤的自顶向下的解析器\n",
    "\n",
    "左角落解析器，不会陷入左递归产生式死循环。"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "source": [
    "### 8.4.4 符合语句规则的子串表（WFST）\n",
    "\n",
    "上述简单的解析器都存在完整性和效率问题，下面将基于图表分析：即利用动态规划算法来解决这些问题\n",
    "\n",
    "动态规划算法存储中间结果，并且在适当的时候重用，从而显著提高了效率。\n",
    "\n",
    "WFST的缺点：\n",
    "-   WFST本身不是一个分析树\n",
    "-   每个非词汇文法生产式都必须是二元的\n",
    "-   作为一个自下而上的文法，潜在地存在着浪费，因为它会在不符合文法的地方提出成分，后面又会放弃掉错误的成分\n",
    "-   WFST并不能表示句子中的结构歧义（如两个动词短语的读取）\n"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "N -> 'elephant'\nN\n('elephant',)\n"
     ]
    }
   ],
   "source": [
    "# 可以在文法中直接查找文本中单词所属类别\n",
    "# lhs : left-hand-side; rhs : right-hand-side\n",
    "text = ['I', 'shot', 'an', 'elephant', 'in', 'my', 'pajamas']\n",
    "productions = groucho_grammar.productions(rhs=text[3])\n",
    "for product in productions:\n",
    "    print(product)\n",
    "    print(product.lhs())\n",
    "    print(product.rhs())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Ex8-3 使用符合语句规则的子串表接收器\n",
    "def init_wfst(tokens, grammar):\n",
    "    numtokens = len(tokens)\n",
    "    wfst = [[None for i in range(numtokens + 1)] for j in range(numtokens + 1)]\n",
    "    for i in range(numtokens):\n",
    "        productions = grammar.productions(rhs=tokens[i])\n",
    "        wfst[i][i + 1] = productions[0].lhs()\n",
    "    return wfst"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [],
   "source": [
    "def complete_wfst(wfst, tokens, grammar, trace=False):\n",
    "    index = dict((p.rhs(), p.lhs()) for p in grammar.productions())\n",
    "    numtokens = len(tokens)\n",
    "    for span in range(2, numtokens + 1):\n",
    "        for start in range(numtokens + 1 - span):\n",
    "            end = start + span\n",
    "            for mid in range(start + 1, end):\n",
    "                nt1, nt2 = wfst[start][mid], wfst[mid][end]\n",
    "                if nt1 and nt2 and (nt1, nt2) in index:\n",
    "                    wfst[start][end] = index[(nt1, nt2)]\n",
    "                    if trace:\n",
    "                        print(\"[%s] %3s [%s] %3s [%s] ==> [%s] %3s [%s]\" %\n",
    "                              (start, nt1, mid, nt2, end, start, index[(nt1, nt2)], end))\n",
    "    return wfst"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [],
   "source": [
    "def display(wfst, tokens):\n",
    "    print('\\nWFST ' + ' '.join([(\"%-4d\" % i) for i in range(1, len(wfst))]))\n",
    "    for i in range(len(wfst) - 1):\n",
    "        print(\"%d   \" % i, end=\" \")\n",
    "        for j in range(1, len(wfst)):\n",
    "            print(\"%-4s\" % (wfst[i][j] or '.'), end=\" \")\n",
    "        print()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "[[None, NP, None, None, None, None, None, None], [None, None, V, None, None, None, None, None], [None, None, None, Det, None, None, None, None], [None, None, None, None, N, None, None, None], [None, None, None, None, None, P, None, None], [None, None, None, None, None, None, Det, None], [None, None, None, None, None, None, None, N], [None, None, None, None, None, None, None, None]]\n\nWFST 1    2    3    4    5    6    7   \n0    NP   .    .    .    .    .    .    \n1    .    V    .    .    .    .    .    \n2    .    .    Det  .    .    .    .    \n3    .    .    .    N    .    .    .    \n4    .    .    .    .    P    .    .    \n5    .    .    .    .    .    Det  .    \n6    .    .    .    .    .    .    N    \n"
     ]
    }
   ],
   "source": [
    "tokens = ['I', 'shot', 'an', 'elephant', 'in', 'my', 'pajamas']\n",
    "wfst0 = init_wfst(tokens, groucho_grammar)\n",
    "display(wfst0, tokens)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "\nWFST 1    2    3    4    5    6    7   \n0    NP   .    .    S    .    .    S    \n1    .    V    .    VP   .    .    VP   \n2    .    .    Det  NP   .    .    .    \n3    .    .    .    N    .    .    .    \n4    .    .    .    .    P    .    PP   \n5    .    .    .    .    .    Det  NP   \n6    .    .    .    .    .    .    N    \n"
     ]
    }
   ],
   "source": [
    "wfst1 = complete_wfst(wfst0, tokens, groucho_grammar)\n",
    "display(wfst1, tokens)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "[2] Det [3]   N [4] ==> [2]  NP [4]\n[5] Det [6]   N [7] ==> [5]  NP [7]\n[1]   V [2]  NP [4] ==> [1]  VP [4]\n[4]   P [5]  NP [7] ==> [4]  PP [7]\n[0]  NP [1]  VP [4] ==> [0]   S [4]\n[1]  VP [4]  PP [7] ==> [1]  VP [7]\n[0]  NP [1]  VP [7] ==> [0]   S [7]\n\nWFST 1    2    3    4    5    6    7   \n0    NP   .    .    S    .    .    S    \n1    .    V    .    VP   .    .    VP   \n2    .    .    Det  NP   .    .    .    \n3    .    .    .    N    .    .    .    \n4    .    .    .    .    P    .    PP   \n5    .    .    .    .    .    Det  NP   \n6    .    .    .    .    .    .    N    \n"
     ]
    }
   ],
   "source": [
    "# 显示剖析过程\n",
    "wfst1 = complete_wfst(wfst0, tokens, groucho_grammar, trace=True)\n",
    "display(wfst1, tokens)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "output_type": "error",
     "ename": "TypeError",
     "evalue": "app() takes 0 positional arguments but 2 were given",
     "traceback": [
      "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[1;31mTypeError\u001b[0m                                 Traceback (most recent call last)",
      "\u001b[1;32m<ipython-input-29-56b36ce95b0a>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m\u001b[0m\n\u001b[1;32m----> 1\u001b[1;33m \u001b[0mnltk\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mapp\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mchartparser\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mgroucho_grammar\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mtokens\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m",
      "\u001b[1;31mTypeError\u001b[0m: app() takes 0 positional arguments but 2 were given"
     ]
    }
   ],
   "source": [
    "nltk.app.chartparser()"
   ]
  },
  {
   "source": [
    "## 8.5 依存关系 和 依存文法\n",
    "短语结构文法：描述句子中的词和词序列的结合方式\n",
    "\n",
    "-   依存文法：描述词与其他词之间的关系\n",
    "    -    依存关系是一个中心词与其从属之间二元非对称关系。\n",
    "        -   一个句子的中心词通常是动词，其他词直接依赖于中心词或者通过某些路径依赖于中心词"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 下面是NLTK为依存文法编码的一种方式，只能捕捉依存关系信息，不能指定依存关系的类型\n",
    "groucho_dep_grammar = nltk.DependencyGrammar.fromstring(\"\"\"\n",
    "'shot' -> 'I' | 'elephant' | 'in'\n",
    "'elephant' -> 'an' | 'in'\n",
    "'in' -> 'pajamas'\n",
    "'pajamas' -> 'my'\n",
    "\"\"\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "--------------- >第 1 个结构< ---------------\n(shot I (elephant an (in (pajamas my))))\n--------------- >第 2 个结构< ---------------\n(shot I (elephant an) (in (pajamas my)))\n"
     ]
    }
   ],
   "source": [
    "# 依存关系图是一个投影，若所有的词都按照线性顺序书写，则用边连接这些词并且保证所有的边不交叉。\n",
    "# 一个词及其所有子节点在句子中形成了一个连续的词序列。\n",
    "pdp = nltk.ProjectiveDependencyParser(groucho_dep_grammar)\n",
    "sent = 'I shot an elephant in my pajamas'.split()\n",
    "tree = []\n",
    "for i, tree in enumerate(pdp.parse(sent)):\n",
    "    show_subtitle(f\"第 {i + 1} 个结构\")\n",
    "    print(tree)\n",
    "tree.draw()"
   ]
  },
  {
   "source": [
    "#### 非投影的依存关系\n",
    "\n",
    "在一个成分C中决定哪个是中心词H，哪个是依赖D：\n",
    "\n",
    "1.  H决定类型C的分布；或者说C的外部句法属性取决于H\n",
    "2.  H定义C的语义类型\n",
    "3.  H必须存在，而D是可选的\n",
    "4.  H选择D并且决定它是必须的还是可选的\n",
    "5.  D的形态由H决定"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "source": [
    "### 8.5.2 扩大规模\n",
    "\n",
    "“玩具文法”：用于演示分析过程中关键环节的小型文法。将文法扩大到覆盖自然语言中的大型语料库非常困难。\n",
    "\n",
    "比较成功的文法项目\n",
    "\n",
    "-   词汇功能语法（LFG） Pargram项目\n",
    "-   中心词驱动短语结构文法（HPSG）LinGO项目\n",
    "-   邻接着文法XTAG的词汇化树项目"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "source": [
    "## 8.6 文法开发\n",
    "解析器根据短语结构文法在句子上建立树。"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "source": [
    "### 8.6.1 树库 和 文法：使用宾州树库"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "(S\n  (NP-SBJ\n    (NP (NNP Pierre) (NNP Vinken))\n    (, ,)\n    (ADJP (NP (CD 61) (NNS years)) (JJ old))\n    (, ,))\n  (VP\n    (MD will)\n    (VP\n      (VB join)\n      (NP (DT the) (NN board))\n      (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director)))\n      (NP-TMP (NNP Nov.) (CD 29))))\n  (. .))\n"
     ]
    }
   ],
   "source": [
    "from nltk.corpus import treebank\n",
    "\n",
    "t = treebank.parsed_sents('wsj_0001.mrg')[0]\n",
    "print(t)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Ex 8-4 搜索树库找出句子的补语\n",
    "def filter(tree):\n",
    "    child_nodes = [child.label() for child in tree if isinstance(child, nltk.Tree)]\n",
    "    return (tree.label() == 'VP') and ('S' in child_nodes)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "(VP\n  (VBN named)\n  (S\n    (NP-SBJ (-NONE- *-1))\n    (NP-PRD\n      (NP (DT a) (JJ nonexecutive) (NN director))\n      (PP\n        (IN of)\n        (NP (DT this) (JJ British) (JJ industrial) (NN conglomerate))))))\n"
     ]
    }
   ],
   "source": [
    "VPS = [\n",
    "    subtree \n",
    "    for tree in treebank.parsed_sents() \n",
    "    for subtree in tree.subtrees(filter)]\n",
    "print(VPS[0])\n",
    "VPS[1].draw()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "%-below-level N:  ['left'] V: ['be']\n%-from-year N:  ['was'] V: ['declined', 'dropped', 'fell', 'grew', 'increased', 'plunged', 'rose', 'was']\n%-in-August N:  ['was'] V: ['climbed', 'fell', 'leaping', 'rising', 'rose']\n%-in-September N:  ['increased'] V: ['climbed', 'declined', 'dropped', 'edged', 'fell', 'grew', 'plunged', 'rose', 'slipped']\n%-in-week N:  ['declined'] V: ['was']\n%-to-% N:  ['add', 'added', 'backed', 'be', 'cut', 'go', 'grow', 'increased', 'increasing', 'is', 'offer', 'plummet', 'reduce', 'rejected', 'rise', 'risen', 'shaved', 'wants', 'yield', 'zapping'] V: ['fell', 'rise', 'slipped']\n%-to-million N:  ['declining'] V: ['advanced', 'climbed', 'cutting', 'declined', 'declining', 'dived', 'dropped', 'edged', 'fell', 'gained', 'grew', 'increased', 'jump', 'jumped', 'plunged', 'rising', 'rose', 'slid', 'slipped', 'soared', 'tumbled']\n1-to-21 N:  ['dropped'] V: ['dropped']\n1-to-33 N:  ['gained'] V: ['dropped', 'fell', 'jumped']\n1-to-4 N:  ['added'] V: ['gained']\n1-to-47 N:  ['jumped'] V: ['added', 'rose']\n1-to-point N:  ['ended'] V: ['fell', 'rose']\n3-to-17 N:  ['lost'] V: ['lost']\n500,000-in-fines N:  ['paid'] V: ['paid']\n6.9-on-scale N:  ['registered'] V: ['registered']\naccess-to-AZT N:  ['had'] V: ['had']\naccess-to-arena N:  ['permits'] V: ['lack']\nactivity-in-part N:  ['showed'] V: ['attributed']\nagreement-in-principle N:  ['reached'] V: ['reached']\nagreement-with-Inc. N:  ['announced', 'signed'] V: ['signed']\nagreement-with-creditors N:  ['reached'] V: ['nearing']\nagreement-with-regulators N:  ['presages', 'reach'] V: ['reach']\naid-to-Contras N:  ['renewing'] V: ['renewing']\nalliance-with-GM N:  ['discussing', 'wrapping'] V: ['forge', 'have', 'negotiating']\napproval-for-drug N:  ['granted'] V: ['obtain']\nattention-to-comments N:  ['paid'] V: ['paid']\nattention-to-concerns N:  ['pay'] V: ['show']\nattention-to-reports N:  ['paid'] V: ['pay']\nbid-for-company N:  ['fend', 'launch'] V: ['made', 'make']\nbid-for-million N:  ['finance'] V: ['had']\nbids-for-company N:  ['submitted'] V: ['solicit']\nbillion-in-cash N:  ['pay', 'raise'] V: ['raise']\nbillion-of-bills N:  ['sell', 'sold'] V: ['sold']\nbillion-over-years N:  ['total'] V: ['spent']\nbillion-to-billion N:  ['cause', 'place'] V: ['increased', 'rose']\nbusiness-to-firms N:  ['cutting'] V: ['give', 'transfer']\nbusiness-with-them N:  ['cease'] V: ['do']\ncap-on-amount N:  ['eliminate'] V: ['places']\ncents-to-cents N:  ['be', 'recovering'] V: ['fell', 'rose']\nchange-in-earnings N:  ['had'] V: ['had']\nchanges-for-% N:  ['measures'] V: ['measures', 'monitors']\ncharge-in-quarter N:  ['took'] V: ['had', 'included', 'incur', 'take', 'took']\ncollar-on-trading N:  ['re-establishing'] V: ['reinstating']\ncommitments-from-banks N:  ['secured', 'won'] V: ['obtained']\ncompetition-from-competitors N:  ['faced'] V: ['fend']\ncompetition-in-production N:  ['reduce'] V: ['reduce']\ncontract-for-parts N:  ['awarded', 'given', 'won'] V: ['received']\ncontract-for-support N:  ['awarded', 'issued'] V: ['received']\ncontract-from-Co. N:  ['received'] V: ['won']\ncontract-with-Warner N:  ['violates'] V: ['terminate']\ncontrol-of-Inc. N:  ['took'] V: ['seek']\ndecline-for-quarter N:  ['posted'] V: ['reported']\ndecline-in-August N:  ['followed', 'following', 'recorded'] V: ['following']\ndecline-in-earnings N:  ['alleviate', 'report', 'reported'] V: ['expects']\ndeclines-in-prices N:  ['reflect'] V: ['had']\ndisputes-with-company N:  ['resolve'] V: ['resolve']\ndomestic-production-through-July N:  ['includes'] V: ['includes']\ndrop-in-earnings N:  ['posted'] V: ['posted']\ndrop-in-profit N:  ['experienced', 'had', 'posted', 'reported', 'reporting'] V: ['posted']\nearnings-for-companies N:  ['reported'] V: ['reported']\nearnings-for-quarter N:  ['posting'] V: ['posted', 'report', 'reported']\nearnings-in-quarter N:  ['projecting'] V: ['had']\nearnings-of-million N:  ['had', 'include', 'posted', 'reported'] V: ['reported']\neffect-on-market N:  ['had'] V: ['had']\nemphasis-on-quality N:  ['be'] V: ['place']\nfinancing-for-buy-out N:  ['deliver', 'get'] V: ['obtaining']\nfloor-for-price N:  ['establishes'] V: ['providing']\nfoot-in-door N:  ['wanted'] V: ['getting']\nfunding-for-abortion N:  ['supporting'] V: ['oppose']\nfunds-for-station N:  ['including', 'providing'] V: ['includes']\ngain-from-sale N:  ['included', 'includes'] V: ['a-Includes', 'including', 'record', 'report']\ngain-in-profit N:  ['posted', 'reported'] V: ['posted']\nhead-to-head N:  ['going'] V: ['go']\nimpact-on-market N:  ['have'] V: ['has', 'have']\nimpact-on-markets N:  ['had'] V: ['have']\nimpact-on-results N:  ['have'] V: ['have']\nincome-for-quarter N:  ['announcing'] V: ['report']\nincrease-in-earnings N:  ['reported'] V: ['posted']\ninformation-from-companies N:  ['requested'] V: ['steal']\ninquiry-into-activities N:  ['conducted'] V: ['drop']\ninterest-in-company N:  ['bought', 'have', 'holds', 'owning', 'retain'] V: ['represent']\ninterest-in-metals N:  ['create'] V: ['was']\ninterest-on-loans N:  ['computing'] V: ['pay']\nloans-to-China N:  ['suspended'] V: ['resuming']\nloss-for-quarter N:  ['announced', 'have', 'post', 'posted', 'reported', 'reporting'] V: ['post', 'report', 'reported']\nloss-in-quarter N:  ['expect', 'had'] V: ['caused', 'had', 'posted', 'took']\nlosses-in-years N:  ['reported'] V: ['had']\nmarkets-in-stocks N:  ['making'] V: ['make']\nmillion-before-tax N:  ['reported'] V: ['contributed']\nmillion-for-initiative N:  ['attached'] V: ['add']\nmillion-for-stake N:  ['pay'] V: ['paid', 'pay', 'putting']\nmillion-from-funds N:  ['commit'] V: ['raises']\nmillion-from-operations N:  ['included'] V: ['reported']\nmillion-from-sale N:  ['including'] V: ['take']\nmillion-in-payments N:  ['make', 'owes', 'pay', 'receive'] V: ['fallen']\nmillion-of-debt N:  ['add', 'borrow', 'consolidate', 'convert', 'downgraded', 'includes', 'pay', 'raise'] V: ['assume']\nmillion-on-revenue N:  ['earned'] V: ['earned', 'was', 'were']\nmillion-on-sales N:  ['earned'] V: ['earned', 'reach', 'totaled', 'was', 'were']\nmillion-to-million N:  ['be', 'bills', 'cost', 'pump', 'sell', 'totaled'] V: ['declined', 'fell', 'spend', 'tumbled']\nmonth-in-time N:  ['delivered'] V: ['delivered']\nnet-on-revenue N:  ['posted'] V: ['reported']\nnothing-about-it N:  ['knew'] V: ['doing']\noffer-for-all N:  ['begun', 'make'] V: ['begin']\noffer-for-shares N:  ['began', 'extended'] V: ['launched', 'made']\noffer-for-stock N:  ['extended'] V: ['make']\noffer-from-group N:  ['rejected'] V: ['received']\noffice-in-Worth N:  ['Call'] V: ['has']\npace-with-inflation N:  ['keep', 'keeping'] V: ['keep']\npayment-on-million N:  ['make'] V: ['make']\npayments-on-debt N:  ['cover', 'make'] V: ['make']\npresident-in-charge N:  ['is', 'named'] V: ['been']\npressure-on-government N:  ['keep'] V: ['keep', 'put']\npressure-on-prices N:  ['suffered'] V: ['keep', 'put']\nprice-for-incentives N:  ['paid'] V: ['paid']\nprices-on-market N:  ['push'] V: ['bring']\nprofit-for-year N:  ['report'] V: ['report']\nprofit-from-discontinued N:  ['had'] V: ['was']\nprofit-in-quarter N:  ['indicates'] V: ['produce', 'recorded']\nprojections-for-year N:  ['slashed'] V: ['exceed']\nprovisions-for-loans N:  ['taken'] V: ['made']\nrates-to-% N:  ['boosting'] V: ['increase', 'pushed', 'raised']\nreporter-in-bureau N:  ['is'] V: ['is']\nrestrictions-on-use N:  ['waiving'] V: ['impose']\nrevenue-for-year N:  ['projected'] V: ['had']\nrevenue-in-quarter N:  ['expects'] V: ['had']\nsales-in-excess N:  ['combined'] V: ['had']\nsales-in-quarter N:  ['had'] V: ['increasing']\nsales-of-million N:  ['expected', 'had', 'has', 'have', 'posted'] V: ['had']\nsalvo-in-battle N:  ['marking'] V: ['marking']\nservices-for-customers N:  ['offering'] V: ['provide']\nshareholder-in-bank N:  ['become'] V: ['become']\nstake-in-Airlines N:  ['acquiring', 'buy', 'raise'] V: ['buy']\nstake-in-Mixte N:  ['bring'] V: ['boost']\nstake-in-Rally N:  ['hold'] V: ['had']\nstake-in-company N:  ['bought', 'building', 'built', 'buying', 'give', 'hold', 'obtaining', 'own', 'owns', 'raised', 'take'] V: ['accumulating', 'had', 'has', 'holds', 'own']\nstake-in-concern N:  ['acquires', 'lowered'] V: ['retaining']\nstake-in-unit N:  ['sell'] V: ['acquire']\nstake-in-venture N:  ['has', 'hold', 'holds'] V: ['held']\nsuit-against-Keating N:  ['press'] V: ['brought']\nswings-in-market N:  ['cause', 'create'] V: ['cause']\nsystem-for-city N:  ['design'] V: ['design']\nsystem-in-Pakistan N:  ['operate'] V: ['operate']\ntime-for-Congress N:  ['is'] V: ['buy', 'buys']\nventure-with-Co. N:  ['started'] V: ['started']\nventures-with-companies N:  ['established'] V: ['form']\nverdict-in-case N:  ['is', 'won'] V: ['won']\nvolatility-in-stocks N:  ['ignoring'] V: ['see']\nvote-on-issue N:  ['allow'] V: ['prevent']\nway-for-declines N:  ['open'] V: ['pave']\nwriter-in-York N:  ['is'] V: ['is']\nyen-to-yen N:  ['shed'] V: ['advanced', 'fell', 'gained', 'lost', 'rose']\n"
     ]
    }
   ],
   "source": [
    "# Prepositional Phrase Attachment Corpus. 介词短语附着语料库，是特别动词配价的信息源\n",
    "# 搜索语料库，找出具有固定介词和名词和介词短语对，其中介绍短语附着到VP还是NP由选择的动词决定\n",
    "from collections import defaultdict\n",
    "\n",
    "entries = nltk.corpus.ppattach.attachments('training')\n",
    "table = defaultdict(lambda: defaultdict(set))\n",
    "for entry in entries:\n",
    "    key = entry.noun1 + '-' + entry.prep + '-' + entry.noun2\n",
    "    table[key][entry.attachment].add(entry.verb)\n",
    "\n",
    "for key in sorted(table):\n",
    "    if len(table[key]) > 1:\n",
    "        print(key, 'N: ', sorted(table[key]['N']), 'V:', sorted(table[key]['V']))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "key= zip-in-way\ntable[key]=  defaultdict(<class 'set'>, {'V': {'requires'}})\ntable['zip-in-way']=  defaultdict(<class 'set'>, {'V': {'requires'}})\nlen(table[key])=  1\ntable['access-to-AZT']=  defaultdict(<class 'set'>, {'V': {'had'}, 'N': {'had'}})\ntable['offer-from-group']=  defaultdict(<class 'set'>, {'V': {'received'}, 'N': {'rejected'}})\n"
     ]
    }
   ],
   "source": [
    "print(\"key=\", key)\n",
    "print(\"table[key]= \", table[key])\n",
    "print(\"table['zip-in-way']= \", table['zip-in-way'])\n",
    "print(\"len(table[key])= \", len(table[key]))\n",
    "print(\"table['access-to-AZT']= \", table['access-to-AZT'])\n",
    "print(\"table['offer-from-group']= \", table['offer-from-group'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 现代汉语中央研究院平衡语料库中的10000句已经分析的句子\n",
    "nltk.corpus.sinica_treebank.parsed_sents()[3450].draw()"
   ]
  },
  {
   "source": [
    "### 8.6.2 有害的歧义"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {},
   "outputs": [],
   "source": [
    "grammar = nltk.CFG.fromstring(\"\"\"\n",
    "S -> NP V NP\n",
    "NP -> NP Sbar\n",
    "Sbar -> NP V\n",
    "NP -> 'fish'\n",
    "V -> 'fish'\n",
    "\"\"\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "--------------- >第 1 个结构< ---------------\n(S (NP fish) (V fish) (NP (NP fish) (Sbar (NP fish) (V fish))))\n--------------- >第 2 个结构< ---------------\n(S (NP (NP fish) (Sbar (NP fish) (V fish))) (V fish) (NP fish))\n"
     ]
    }
   ],
   "source": [
    "# 当fish的数量为（3，5，7...），分析树的数量是（1，2，5...），这是Catalan数\n",
    "tokens = ['fish'] * 5\n",
    "cp = nltk.ChartParser(grammar)\n",
    "for i, tree in enumerate(cp.parse(tokens)):\n",
    "    show_subtitle(f\"第 {i + 1} 个结构\")\n",
    "    print(tree)\n",
    "tree.draw()"
   ]
  },
  {
   "source": [
    "### 8.6.3 加权语法"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Ex8-5: 宾州树库样本中 give 和 gave 的用法\n",
    "# 检查所有包含 give 介词格和双宾语结构的实例\n",
    "def give(t):\n",
    "    result = t.label() == 'VP' and len(t) > 2 and t[1].label() == 'NP'  # give 双宾语结构\n",
    "    result = result and (t[2].label() == 'PP-DTV' or t[2].label() == 'NP')  # give 介绍格\n",
    "    result = result and ('give' in t[0].leaves() or 'gave' in t[0].leaves())  # give和gave的用法\n",
    "    return result"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {},
   "outputs": [],
   "source": [
    "def sent(tree):\n",
    "    return ' '.join(token for token in tree.leaves() if token[0] not in '*-0')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {},
   "outputs": [],
   "source": [
    "def print_node(t, width):\n",
    "    output = '%s %s: %s / %s: %s' % (\n",
    "        sent(t[0]), t[1].label(), sent(t[1]), t[2].label(), sent(t[2]))\n",
    "    if len(output) > width:\n",
    "        output = output[:width] + '...'\n",
    "    print(output)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "gave NP: the chefs / NP: a standing ovation\n",
      "give NP: advertisers / NP: discounts for maintaining or increasing ad sp...\n",
      "give NP: it / PP-DTV: to the politicians\n",
      "gave NP: them / NP: similar help\n",
      "give NP: them / NP: \n",
      "give NP: only French history questions / PP-DTV: to students in a Europe...\n",
      "give NP: federal judges / NP: a raise\n",
      "give NP: consumers / NP: the straight scoop on the U.S. waste crisis\n",
      "gave NP: Mitsui / NP: access to a high-tech medical product\n",
      "give NP: Mitsubishi / NP: a window on the U.S. glass industry\n",
      "give NP: much thought / PP-DTV: to the rates she was receiving , nor to ...\n",
      "give NP: your Foster Savings Institution / NP: the gift of hope and free...\n",
      "give NP: market operators / NP: the authority to suspend trading in futu...\n",
      "gave NP: quick approval / PP-DTV: to $ 3.18 billion in supplemental appr...\n",
      "give NP: the Transportation Department / NP: up to 50 days to review any...\n",
      "give NP: the president / NP: such power\n",
      "give NP: me / NP: the heebie-jeebies\n",
      "give NP: holders / NP: the right , but not the obligation , to buy a cal...\n",
      "gave NP: Mr. Thomas / NP: only a `` qualified '' rating , rather than ``...\n",
      "give NP: the president / NP: line-item veto power\n",
      "(VP\n",
      "  (VB give)\n",
      "  (NP (DT the) (NN president))\n",
      "  (NP (JJ line-item) (NN veto) (NN power))\n",
      "  (ADVP (-NONE- *T*-2)))\n"
     ]
    }
   ],
   "source": [
    "for tree in nltk.corpus.treebank.parsed_sents():\n",
    "    for t in tree.subtrees(give):\n",
    "        print_node(t, 72)\n",
    "print(t)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Ex8-6: 定义一个概率上下文无关文法（PCFG）\n",
    "# 只是演示了一个概率上下文无关文法的作用，也是个玩具文法\n",
    "grammar = nltk.PCFG.fromstring(\"\"\"\n",
    "S    -> NP VP              [1.0]\n",
    "VP   -> TV NP              [0.4]\n",
    "VP   -> IV                 [0.3]\n",
    "VP   -> DatV NP NP         [0.3]\n",
    "TV   -> 'saw'              [1.0]\n",
    "IV   -> 'ate'              [1.0]\n",
    "DatV -> 'gave'             [1.0]\n",
    "NP   -> 'telescopes'       [0.8]\n",
    "NP   -> 'Jack'             [0.2]\n",
    "\"\"\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "--------------- >第 1 个结构< ---------------\n(S (NP Jack) (VP (TV saw) (NP telescopes))) (p=0.064)\n"
     ]
    }
   ],
   "source": [
    "viterbi_parser = nltk.ViterbiParser(grammar)\n",
    "for i, tree in enumerate(viterbi_parser.parse(['Jack', 'saw', 'telescopes'])):\n",
    "    show_subtitle(f\"第 {i + 1} 个结构\")\n",
    "    print(tree)\n",
    "tree.draw()    "
   ]
  },
  {
   "source": [
    "## 8.7 小结\n",
    "-   句子的内部结构用树来表示。组织结构的显著特点是：递归、中心词、补语和修饰语\n",
    "-   文法是可能句子的集合的紧凑型特性：一棵树是符合语法规则的，或者文法是可以授权给一棵树的\n",
    "-   文法是一种用于描述给定短语是否可以被分配给特定成分或者依存结构的形式化模型\n",
    "-   给定一组句法类别，上下文无关文法使用生产式表示某个类型 $A$ 的短语是如何被分析成较小的序列 $\\alpha_1,\\cdots,\\alpha_n$\n",
    "-   依存文法使用产生式指定给定的中心词的依赖是什么\n",
    "-   当句子有一个以上的文法分析时，就会产生句法歧义（如介词短语附着歧义）\n",
    "-   解析器是寻找一个或者多个与符合语法规则句子相对应树的程序\n",
    "-   下降递归解析器是一个简单的自顶而下的解析器，利用文法产生式，递归可扩展的开始符号，并且深度匹配输入的句子。\n",
    "    -   下降递归解析器不能处理左递归产生式。\n",
    "    -   下降递归解析器只能盲目扩充类别而不检查是否与输入字符串兼容，导致效率低下。\n",
    "-   「移进——规约解析器」是一个简单的自底向上的解析器，输入被移到堆栈中，并且尝试匹配堆栈顶部的项目和文法产生式右边的部分。\n",
    "    -   移进——规约解析器，不能保证有效地解析输入，即使确实存在\n",
    "    -   移进——规约解析器，能够建立子结构，但是不检查子结构是否与全部文法一致。"
   ],
   "cell_type": "markdown",
   "metadata": {}
  }
 ]
}